Executive summary

Since the outbreak of the COVID-19 pandemic at the beginning of 2020, the scientific circles have concentrated on studying the structure and characteristics of SARS-CoV-2 virus in order to invent an effective vaccine in the shortest possible time. However, before accomplishing that task it’s crucial to find accurate indicators, which would be helpful in early diagnosis. The aim of the following analysis was to study medical data collected in the early state of pandemic in search of interesting patterns. The first section showed preparation of the environment, which assures the repeatability of the analysis with each run. Next section introduced the original structure of source data, presented basic statistics and walked through the following steps of data processing until obtaining a clean dataset. For better understanding of each variable, the codebook was prepared and can be found in the Data description section. The main part of the analysis begins in the Dataset analysis section. Trends over time indicated that the number of infected patients started growing rapidly around January 27, 2020 and the sooner an infected patient got under medical care, the higher chances of survival he or she had. The majority of people who got to the hospital already in severe condition had died. The elderly and people with preexisting severe health conditions reported to have a higher fatality rate of COVID-19. The blood test results suggested that among available types of blood exams the most reliable biomarkers to detect coronavirus-19 infection were hs-cTnI, NT-proBNP and hs-CRP. Attribute correlation study confirmed that levels of substances tested in standard Blood Biochemistry, Complete Blood Count and Coagulation tests are correlated in each of these categories. The classifier detected that for the presented dataset the crucial tests in outcome of treatment prediction were: LDH level, amounts of neutrophils and lymphocytes, high sensitivity C-reactive protein and albumin protein levels and most of available coagulation tests.

Environment preparation

Setting global options and initialization of the seed variable, which is necessary for ensuring reproducibility.

knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(message = FALSE)

set.seed(23)

Loading the libraries used in the project.

library(knitr)
library(xlsx)
library(dplyr)
library(tidyr)
library(lubridate)
library(arules)
library(caret)
library(DT)
library(kableExtra)
library(plotly)
library(ggplot2)
library(corrplot)
library(codebook)

Dataset processing

The dataset is based on the medical records of 375 patients suffering from COVID-19 collected between 10 January and 18 February 2020 in Tongji Hospital in China. The medical records were gathered by using standard case report forms, which included clinical, epidemiological, demographic, laboratory and mortality outcome information. Data collecting methods and some basic statistics are presented in Tan et al article published in Nature, 14 May 2020.

This section covers loading source data, initial remarks about its content, cleaning phase and detailed dataset description.

Loading source data

The dataset can be downloaded under this link. In this project it’s stored in a local subdirectory to ensure availability, even when it’s deleted from the original source.

Loading source data into a dataframe without any pre clean-up with caching option checked.

path <- '../res/wuhan_blood_sample_data_Jan_Feb_2020.xlsx'
df_raw <- read.xlsx(path, sheetIndex=1, header=TRUE)

Initial remarks and pre-cleanup

Raw dataset counts 6120 rows and 81 columns. Each row has a timestamp when the blood test result was released, which is stored in RE_DATE column. As we notice, values in this column are not unique and there are some missing values. Looking at the arrangement of the rows we can conclude that the dataset is ordered by the increasing value of patient ID (PATIENT_ID column). What’s more, multiple lines are linked to a single patient, but only the first one is properly tagged with patient ID.
rows columns unique RE_DATE NAs in RE_DATE NAs in PATIENT_ID
6120 81 3092 14 5745

Unique keys

In order to get a distinctive key for each line we will combine RE_DATE with PATIENT_ID values. In the first step we will remove rows with NA in RE_DATE, because after further checking we can see that all of the blood test results columns are empty for these lines. Then we will fill the PATIENT_ID column.

all(sapply(df_raw[is.na(df_raw$RE_DATE), c(8:81)], is.na))
## [1] TRUE
df <- df_raw %>% 
  filter(!is.na(RE_DATE)) %>%
  fill(PATIENT_ID)

nrow(df)
## [1] 6106
length(unique(paste(df$PATIENT_ID, df$RE_DATE)))
## [1] 6106

Attribute names and data types

The columns can be divided into 2 categories:

  • information about a patient
  • specific blood test result

Short notes about the first 7 columns, containing information about a patient, are shown below.

## 'data.frame':    6106 obs. of  7 variables:
##  $ PATIENT_ID    : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ RE_DATE       : POSIXct, format: "2020-01-31 01:09:00" "2020-01-31 01:25:00" ...
##  $ age           : num  73 73 73 73 73 73 73 73 73 73 ...
##  $ gender        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ Admission.time: POSIXct, format: "2020-01-30 22:12:47" "2020-01-30 22:12:47" ...
##  $ Discharge.time: POSIXct, format: "2020-02-17 12:40:08" "2020-02-17 12:40:08" ...
##  $ outcome       : num  0 0 0 0 0 0 0 0 0 0 ...

As we can see, the gender and outcome of treatment columns are encoded with numbers. Both of these attributes have only 2 unique values, so we will transform them into factors for more clarity. The information about encoding was found in the article mentioned before, which was used as a reference in this analysis.

df <- df %>%
  mutate(gender = factor(gender, labels = c('male', 'female')),
         outcome = factor(outcome, labels = c('survived', 'deceased')))
The next table presents a summary for the remaining columns, which store results of specific blood tests.
Hypersensitive.cardiac.troponinI hemoglobin Serum.chloride Prothrombin.time procalcitonin eosinophils… Interleukin.2.receptor Alkaline.phosphatase albumin basophil… Interleukin.10 Total.bilirubin Platelet.count monocytes… antithrombin Interleukin.8 indirect.bilirubin Red.blood.cell.distribution.width. neutrophils… total.protein Quantification.of.Treponema.pallidum.antibodies Prothrombin.activity HBsAg mean.corpuscular.volume hematocrit White.blood.cell.count Tumor.necrosis.factorα mean.corpuscular.hemoglobin.concentration fibrinogen Interleukin.1β Urea lymphocyte.count PH.value Red.blood.cell.count Eosinophil.count Corrected.calcium Serum.potassium glucose neutrophils.count Direct.bilirubin Mean.platelet.volume ferritin RBC.distribution.width.SD Thrombin.time X…lymphocyte HCV.antibody.quantification D.D.dimer Total.cholesterol aspartate.aminotransferase Uric.acid HCO3. calcium Amino.terminal.brain.natriuretic.peptide.precursor.NT.proBNP. Lactate.dehydrogenase platelet.large.cell.ratio. Interleukin.6 Fibrin.degradation.products monocytes.count PLT.distribution.width globulin γ.glutamyl.transpeptidase International.standard.ratio basophil.count… X2019.nCoV.nucleic.acid.detection mean.corpuscular.hemoglobin. Activation.of.partial.thromboplastin.time High.sensitivity.C.reactive.protein HIV.antibody.quantification serum.sodium thrombocytocrit ESR glutamic.pyruvic.transaminase eGFR creatinine
Min. : 1.9 Min. : 6.4 Min. : 71.50 Min. : 11.50 Min. : 0.020 Min. :0.000 Min. : 61.0 Min. : 17.00 Min. :13.60 Min. :0.00 Min. : 5.00 Min. : 2.50 Min. : -1.0 Min. : 0.300 Min. : 20.00 Min. : 5.000 Min. : 0.100 Min. :10.60 Min. : 1.7 Min. :31.80 Min. : 0.020 Min. : 6.00 Min. : 0.000 Min. : 61.60 Min. :14.50 Min. : 0.13 Min. : 4.00 Min. :286.0 Min. : 0.500 Min. : 5.00 Min. : 0.800 Min. : 0.000 Min. :5.000 Min. : 0.100 Min. :0.000 Min. :1.650 Min. : 2.760 Min. : 1.000 Min. : 0.06 Min. : 1.600 Min. : 8.50 Min. : 17.8 Min. : 31.30 Min. : 13.00 Min. : 0.000 Min. :0.020 Min. : 0.210 Min. :0.100 Min. : 6.00 Min. : 43.0 Min. : 6.30 Min. :1.170 Min. : 5 Min. : 110.0 Min. :11.20 Min. : 1.500 Min. : 4.00 Min. : 0.010 Min. : 8.00 Min. :10.10 Min. : 3.00 Min. : 0.840 Min. :0.000 Min. :-1 Min. :20.4 Min. : 21.80 Min. : 0.10 Min. :0.05 Min. :115.4 Min. :0.010 Min. : 1.00 Min. : 5.00 Min. : 2.00 Min. : 11.00
1st Qu.: 4.4 1st Qu.:113.0 1st Qu.: 99.05 1st Qu.: 13.60 1st Qu.: 0.040 1st Qu.:0.000 1st Qu.: 459.5 1st Qu.: 54.00 1st Qu.:27.40 1st Qu.:0.10 1st Qu.: 5.00 1st Qu.: 7.40 1st Qu.:109.0 1st Qu.: 2.800 1st Qu.: 74.00 1st Qu.: 8.675 1st Qu.: 3.800 1st Qu.:12.00 1st Qu.:65.1 1st Qu.:61.00 1st Qu.: 0.040 1st Qu.: 65.00 1st Qu.: 0.000 1st Qu.: 86.90 1st Qu.:33.50 1st Qu.: 4.94 1st Qu.: 6.70 1st Qu.:333.0 1st Qu.: 3.050 1st Qu.: 5.00 1st Qu.: 4.000 1st Qu.: 0.460 1st Qu.:6.000 1st Qu.: 3.680 1st Qu.:0.000 1st Qu.:2.270 1st Qu.: 3.950 1st Qu.: 5.550 1st Qu.: 3.09 1st Qu.: 3.225 1st Qu.:10.10 1st Qu.: 377.2 1st Qu.: 38.50 1st Qu.: 15.60 1st Qu.: 3.925 1st Qu.:0.040 1st Qu.: 0.602 1st Qu.:3.010 1st Qu.: 19.50 1st Qu.: 183.2 1st Qu.:21.00 1st Qu.:1.980 1st Qu.: 150 1st Qu.: 218.0 1st Qu.:25.60 1st Qu.: 4.772 1st Qu.: 4.00 1st Qu.: 0.270 1st Qu.:11.10 1st Qu.:29.70 1st Qu.: 22.00 1st Qu.: 1.030 1st Qu.:0.010 1st Qu.:-1 1st Qu.:29.7 1st Qu.: 35.30 1st Qu.: 5.70 1st Qu.:0.07 1st Qu.:137.7 1st Qu.:0.150 1st Qu.: 14.00 1st Qu.: 16.00 1st Qu.: 63.58 1st Qu.: 58.00
Median : 20.6 Median :125.0 Median :102.10 Median : 14.80 Median : 0.100 Median :0.100 Median : 676.5 Median : 69.50 Median :32.20 Median :0.20 Median : 5.90 Median : 10.70 Median :178.0 Median : 5.700 Median : 86.00 Median : 16.000 Median : 5.400 Median :12.60 Median :82.4 Median :65.90 Median : 0.050 Median : 81.00 Median : 0.010 Median : 90.10 Median :36.60 Median : 7.72 Median : 8.60 Median :343.0 Median : 4.120 Median : 5.00 Median : 5.985 Median : 0.800 Median :6.500 Median : 4.140 Median :0.010 Median :2.360 Median : 4.410 Median : 6.990 Median : 5.85 Median : 4.800 Median :10.80 Median : 711.0 Median : 40.90 Median : 16.80 Median :11.450 Median :0.060 Median : 2.155 Median :3.630 Median : 27.00 Median : 243.7 Median :23.50 Median :2.080 Median : 585 Median : 340.0 Median :30.90 Median : 19.265 Median : 17.90 Median : 0.410 Median :12.40 Median :32.70 Median : 34.00 Median : 1.140 Median :0.010 Median :-1 Median :30.9 Median : 39.20 Median : 51.50 Median :0.09 Median :140.4 Median :0.210 Median : 28.00 Median : 24.00 Median : 87.90 Median : 76.00
Mean : 1223.2 Mean :123.1 Mean :103.14 Mean : 16.68 Mean : 1.107 Mean :0.629 Mean : 907.2 Mean : 82.47 Mean :32.01 Mean :0.21 Mean : 16.07 Mean : 16.70 Mean :184.3 Mean : 6.155 Mean : 85.32 Mean : 83.088 Mean : 6.889 Mean :13.07 Mean :77.6 Mean :65.30 Mean : 0.132 Mean : 78.55 Mean : 8.306 Mean : 90.39 Mean :36.55 Mean : 15.60 Mean : 11.58 Mean :342.8 Mean : 4.294 Mean : 6.51 Mean : 9.589 Mean : 1.017 Mean :6.484 Mean : 9.288 Mean :0.039 Mean :2.355 Mean : 4.509 Mean : 8.889 Mean : 7.81 Mean : 9.887 Mean :10.91 Mean : 1379.1 Mean : 42.44 Mean : 18.17 Mean :15.392 Mean :0.117 Mean : 7.943 Mean :3.689 Mean : 46.53 Mean : 276.1 Mean :23.14 Mean :2.078 Mean : 3669 Mean : 474.2 Mean :31.77 Mean : 112.308 Mean : 61.35 Mean : 0.526 Mean :13.01 Mean :33.24 Mean : 55.34 Mean : 1.313 Mean :0.017 Mean :-1 Mean :31.0 Mean : 41.52 Mean : 76.24 Mean :0.10 Mean :141.6 Mean :0.212 Mean : 33.69 Mean : 38.86 Mean : 81.56 Mean : 109.93
3rd Qu.: 223.8 3rd Qu.:137.0 3rd Qu.:105.65 3rd Qu.: 16.70 3rd Qu.: 0.405 3rd Qu.:0.800 3rd Qu.:1155.5 3rd Qu.: 95.00 3rd Qu.:36.60 3rd Qu.:0.30 3rd Qu.: 12.35 3rd Qu.: 16.77 3rd Qu.:248.0 3rd Qu.: 8.600 3rd Qu.: 97.00 3rd Qu.: 35.200 3rd Qu.: 8.000 3rd Qu.:13.70 3rd Qu.:92.3 3rd Qu.:70.45 3rd Qu.: 0.070 3rd Qu.: 95.00 3rd Qu.: 0.010 3rd Qu.: 93.90 3rd Qu.:39.90 3rd Qu.: 12.72 3rd Qu.: 11.50 3rd Qu.:350.0 3rd Qu.: 5.480 3rd Qu.: 5.00 3rd Qu.:11.400 3rd Qu.: 1.310 3rd Qu.:7.294 3rd Qu.: 4.650 3rd Qu.:0.060 3rd Qu.:2.440 3rd Qu.: 4.870 3rd Qu.:10.260 3rd Qu.:10.95 3rd Qu.: 8.275 3rd Qu.:11.50 3rd Qu.: 1425.2 3rd Qu.: 44.70 3rd Qu.: 18.38 3rd Qu.:24.975 3rd Qu.:0.090 3rd Qu.:21.000 3rd Qu.:4.265 3rd Qu.: 42.00 3rd Qu.: 333.8 3rd Qu.:25.90 3rd Qu.:2.190 3rd Qu.: 2625 3rd Qu.: 601.8 3rd Qu.:37.20 3rd Qu.: 60.167 3rd Qu.:150.00 3rd Qu.: 0.580 3rd Qu.:14.30 3rd Qu.:36.50 3rd Qu.: 58.00 3rd Qu.: 1.330 3rd Qu.:0.020 3rd Qu.:-1 3rd Qu.:32.2 3rd Qu.: 44.12 3rd Qu.:118.50 3rd Qu.:0.11 3rd Qu.:143.5 3rd Qu.:0.270 3rd Qu.: 45.50 3rd Qu.: 41.00 3rd Qu.:103.97 3rd Qu.: 98.25
Max. :50000.0 Max. :178.0 Max. :140.40 Max. :120.00 Max. :57.170 Max. :8.600 Max. :7500.0 Max. :620.00 Max. :48.60 Max. :1.70 Max. :1000.00 Max. :505.70 Max. :558.0 Max. :53.000 Max. :136.00 Max. :6795.000 Max. :145.100 Max. :27.10 Max. :98.9 Max. :88.70 Max. :11.950 Max. :142.00 Max. :250.000 Max. :118.90 Max. :52.30 Max. :1726.60 Max. :168.00 Max. :514.0 Max. :10.780 Max. :88.50 Max. :68.400 Max. :52.420 Max. :7.565 Max. :749.500 Max. :0.490 Max. :2.790 Max. :12.800 Max. :43.010 Max. :33.88 Max. :360.600 Max. :15.00 Max. :50000.0 Max. :113.30 Max. :161.90 Max. :60.000 Max. :2.090 Max. :60.000 Max. :7.300 Max. :1858.00 Max. :1176.0 Max. :36.30 Max. :2.620 Max. :70000 Max. :1867.0 Max. :62.20 Max. :5000.000 Max. :190.80 Max. :39.920 Max. :25.30 Max. :50.60 Max. :732.00 Max. :13.480 Max. :0.120 Max. :-1 Max. :50.8 Max. :144.00 Max. :320.00 Max. :0.27 Max. :179.7 Max. :0.510 Max. :110.00 Max. :1600.00 Max. :224.00 Max. :1497.00
NA’s :5599 NA’s :5131 NA’s :5131 NA’s :5444 NA’s :5647 NA’s :5149 NA’s :5838 NA’s :5176 NA’s :5172 NA’s :5149 NA’s :5839 NA’s :5176 NA’s :5149 NA’s :5148 NA’s :5776 NA’s :5838 NA’s :5200 NA’s :5183 NA’s :5149 NA’s :5175 NA’s :5827 NA’s :5447 NA’s :5827 NA’s :5149 NA’s :5149 NA’s :4979 NA’s :5838 NA’s :5149 NA’s :5540 NA’s :5838 NA’s :5170 NA’s :5149 NA’s :5722 NA’s :4979 NA’s :5149 NA’s :5192 NA’s :5126 NA’s :5331 NA’s :5149 NA’s :5176 NA’s :5244 NA’s :5823 NA’s :5183 NA’s :5540 NA’s :5148 NA’s :5827 NA’s :5476 NA’s :5175 NA’s :5171 NA’s :5172 NA’s :5172 NA’s :5127 NA’s :5631 NA’s :5172 NA’s :5244 NA’s :5834 NA’s :5776 NA’s :5149 NA’s :5244 NA’s :5176 NA’s :5176 NA’s :5447 NA’s :5149 NA’s :5605 NA’s :5149 NA’s :5538 NA’s :5369 NA’s :5828 NA’s :5131 NA’s :5244 NA’s :5723 NA’s :5175 NA’s :5170 NA’s :5170

The author of the original dataset decided to label columns with full-length names of the blood tests, which are too long and incomprehensible for our purposes. Also in some of names occur special symbols, which were degenerated during data loading. To obtain easy to work with column names the following steps were performed:

  • replacing too long names with its abbreviations used in medical jargon (if one exists) or made-up short forms
  • standardizing separator symbol to ‘.’
  • replacing string ‘…’, which is a remaining of deformed ‘(%)’ from loading data, with ‘.percent’
  • removing preceding and trailing redundant separator symbol occurrences
  • converting multi-word names to lower case with an exception to names changed by hand
df <- df %>%
  rename("hs.cTnI" = "Hypersensitive.cardiac.troponinI") %>%
  rename("PT" = "Prothrombin.time") %>%
  rename("IL.2.receptor" = "Interleukin.2.receptor") %>%
  rename("ALP" = "Alkaline.phosphatase") %>%
  rename("IL.10" = "Interleukin.10") %>%
  rename("AT" = "antithrombin") %>%
  rename("IL.8" = "Interleukin.8") %>%
  rename("RDW" = "Red.blood.cell.distribution.width.") %>%
  rename("q.anti.TP" = "Quantification.of.Treponema.pallidum.antibodies") %>%
  rename("MCV" = "mean.corpuscular.volume") %>%
  rename("WBC.count" = "White.blood.cell.count") %>%
  rename("TNF.alpha" = "Tumor.necrosis.factorα") %>%
  rename("MCHC" = "mean.corpuscular.hemoglobin.concentration") %>%
  rename("IL.1.beta" = "Interleukin.1β") %>%
  rename("RBC.count" = "Red.blood.cell.count") %>%
  rename("MPV" = "Mean.platelet.volume") %>%
  rename("RBC.DW.SD" = "RBC.distribution.width.SD") %>%
  rename("TT" = "Thrombin.time") %>%
  rename("q.anti.HCV" = "HCV.antibody.quantification") %>%
  rename("D.Dimer" = "D.D.dimer") %>%
  rename("AST" = "aspartate.aminotransferase") %>%
  rename("NT.proBNP" = "Amino.terminal.brain.natriuretic.peptide.precursor.NT.proBNP.") %>%
  rename("LDH" = "Lactate.dehydrogenase") %>%
  rename("P.LCR" = "platelet.large.cell.ratio.") %>%
  rename("IL.6" = "Interleukin.6") %>%
  rename("FDP" = "Fibrin.degradation.products") %>%
  rename("PDW" = "PLT.distribution.width") %>%
  rename("GGT" = "γ.glutamyl.transpeptidase") %>%
  rename("INR" = "International.standard.ratio") %>%
  rename("X2019.nCoV.NAT" = "X2019.nCoV.nucleic.acid.detection") %>%
  rename("MCH" = "mean.corpuscular.hemoglobin.") %>%
  rename("aPTT" = "Activation.of.partial.thromboplastin.time") %>%
  rename("hs.CRP" = "High.sensitivity.C.reactive.protein") %>%
  rename("q.anti.HIV" = "HIV.antibody.quantification") %>%
  rename("ALAT" = "glutamic.pyruvic.transaminase") %>%
  rename_with(~gsub('_', '.', .)) %>%
  rename_with(~gsub('\\.{3}$', '.percent', .)) %>%
  rename_with(~gsub('^X\\.{3}|\\.$', '', .)) %>%
  rename_with(tolower, matches("^[[:upper:]]{1}[[:lower:]]+|^[[:upper:]]{2,}\\.{1}[[:upper:]]{2,}$", ignore.case = FALSE))

Obtained names are presented below.

##  [1] "patient.id"             "re.date"                "age"                   
##  [4] "gender"                 "admission.time"         "discharge.time"        
##  [7] "outcome"                "hs.cTnI"                "hemoglobin"            
## [10] "serum.chloride"         "PT"                     "procalcitonin"         
## [13] "eosinophils.percent"    "IL.2.receptor"          "ALP"                   
## [16] "albumin"                "basophil.percent"       "IL.10"                 
## [19] "total.bilirubin"        "platelet.count"         "monocytes.percent"     
## [22] "AT"                     "IL.8"                   "indirect.bilirubin"    
## [25] "RDW"                    "neutrophils.percent"    "total.protein"         
## [28] "q.anti.TP"              "prothrombin.activity"   "HBsAg"                 
## [31] "MCV"                    "hematocrit"             "WBC.count"             
## [34] "TNF.alpha"              "MCHC"                   "fibrinogen"            
## [37] "IL.1.beta"              "urea"                   "lymphocyte.count"      
## [40] "PH.value"               "RBC.count"              "eosinophil.count"      
## [43] "corrected.calcium"      "serum.potassium"        "glucose"               
## [46] "neutrophils.count"      "direct.bilirubin"       "MPV"                   
## [49] "ferritin"               "RBC.DW.SD"              "TT"                    
## [52] "lymphocyte"             "q.anti.HCV"             "D.Dimer"               
## [55] "total.cholesterol"      "AST"                    "uric.acid"             
## [58] "HCO3"                   "calcium"                "NT.proBNP"             
## [61] "LDH"                    "P.LCR"                  "IL.6"                  
## [64] "FDP"                    "monocytes.count"        "PDW"                   
## [67] "globulin"               "GGT"                    "INR"                   
## [70] "basophil.count.percent" "X2019.nCoV.NAT"         "MCH"                   
## [73] "aPTT"                   "hs.CRP"                 "q.anti.HIV"            
## [76] "serum.sodium"           "thrombocytocrit"        "ESR"                   
## [79] "ALAT"                   "eGFR"                   "creatinine"

All of these attributes are of numeric type. In the summary we could observe that in one column some values are equal to -1. According to the article, this number was selected as a label for missing values. After checking the whole content of that column, we can see that there are only two different values {NA, -1}, which indicates that it’s filled only with missing values, thus should be removed from the dataset.

## [1] "X2019.nCoV.NAT"
## [1] NA -1
df <- df %>% select(-c(colnames(df)[apply(df, 2, function(x) any(x == -1, na.rm = TRUE))]))

Groups detection

The number of lines per patient is clearly inconstant, which suggests that every patient could have undergone various combinations of specific blood tests. Also, it’s common knowledge that one sample of drawn blood is enough to perform multiple tests. We would like to check if some of the blood tests available in our dataset were more likely to be performed together.

Firstly, let’s check how many unique blood tests combinations are present in the dataset.

trans <- unique(apply(!is.na(df[, 8:80]), 1, which))
length(trans)
## [1] 175

Now we will try to find closed frequent itemsets with apriori algorithm. The maximum number of items per itemset is set to 25. The maximum subset checking time is 60 seconds, support = 0.09 and confidence = 0.8.

freq_itemsets <- apriori(trans, parameter = list(support=0.09, maxlen=25, maxtime=60, target="closed frequent itemsets"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##          NA    0.1    1 none FALSE            TRUE      60    0.09      1
##  maxlen                   target  ext
##      25 closed frequent itemsets TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 15 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[73 item(s), 175 transaction(s)] done [0.00s].
## sorting and recoding items ... [59 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 done [28.85s].
## filtering closed item sets ... done [16.54s].
## sorting transactions ... done [0.00s].
## writing ... [162 set(s)] done [0.27s].
## creating S4 object  ... done [0.37s].
max_freq_itemsets <- freq_itemsets[is.maximal(freq_itemsets)]
itemLabels(max_freq_itemsets) <- colnames(df[, 8:80])
inspect(sort(max_freq_itemsets, by='support'))
##      items                       support transIdenticalToItemsets count
## [1]  {hs.cTnI}                0.18285714              0.000000000    32
## [2]  {procalcitonin,                                                   
##       NT.proBNP}              0.14857143              0.000000000    26
## [3]  {q.anti.TP,                                                       
##       HBsAg,                                                           
##       q.anti.HCV,                                                      
##       q.anti.HIV}             0.11428571              0.000000000    20
## [4]  {serum.chloride,                                                  
##       ALP,                                                             
##       albumin,                                                         
##       total.bilirubin,                                                 
##       indirect.bilirubin,                                              
##       total.protein,                                                   
##       urea,                                                            
##       corrected.calcium,                                               
##       serum.potassium,                                                 
##       glucose,                                                         
##       direct.bilirubin,                                                
##       total.cholesterol,                                               
##       AST,                                                             
##       uric.acid,                                                       
##       HCO3,                                                            
##       calcium,                                                         
##       LDH,                                                             
##       globulin,                                                        
##       GGT,                                                             
##       hs.CRP,                                                          
##       serum.sodium,                                                    
##       ALAT,                                                            
##       eGFR,                                                            
##       creatinine}             0.11428571              0.040000000    20
## [5]  {PT,                                                              
##       AT,                                                              
##       prothrombin.activity,                                            
##       fibrinogen,                                                      
##       TT,                                                              
##       D.Dimer,                                                         
##       FDP,                                                             
##       INR,                                                             
##       aPTT}                   0.10857143              0.000000000    19
## [6]  {ferritin,                                                        
##       hs.CRP}                 0.09714286              0.005714286    17
## [7]  {hemoglobin,                                                      
##       eosinophils.percent,                                             
##       basophil.percent,                                                
##       platelet.count,                                                  
##       monocytes.percent,                                               
##       neutrophils.percent,                                             
##       MCV,                                                             
##       hematocrit,                                                      
##       WBC.count,                                                       
##       MCHC,                                                            
##       lymphocyte.count,                                                
##       RBC.count,                                                       
##       eosinophil.count,                                                
##       neutrophils.count,                                               
##       lymphocyte,                                                      
##       monocytes.count,                                                 
##       basophil.count.percent,                                          
##       MCH}                    0.09714286              0.000000000    17
## [8]  {procalcitonin,                                                   
##       glucose}                0.09142857              0.022857143    16
## [9]  {serum.chloride,                                                  
##       PT,                                                              
##       prothrombin.activity,                                            
##       serum.potassium,                                                 
##       calcium,                                                         
##       INR,                                                             
##       serum.sodium}           0.09142857              0.000000000    16
## [10] {ALP,                                                             
##       albumin,                                                         
##       total.bilirubin,                                                 
##       indirect.bilirubin,                                              
##       total.protein,                                                   
##       urea,                                                            
##       direct.bilirubin,                                                
##       ferritin,                                                        
##       total.cholesterol,                                               
##       AST,                                                             
##       uric.acid,                                                       
##       HCO3,                                                            
##       LDH,                                                             
##       globulin,                                                        
##       GGT,                                                             
##       ALAT,                                                            
##       eGFR,                                                            
##       creatinine}             0.09142857              0.000000000    16
## [11] {serum.chloride,                                                  
##       ALP,                                                             
##       albumin,                                                         
##       total.bilirubin,                                                 
##       total.protein,                                                   
##       urea,                                                            
##       corrected.calcium,                                               
##       serum.potassium,                                                 
##       direct.bilirubin,                                                
##       ferritin,                                                        
##       total.cholesterol,                                               
##       AST,                                                             
##       uric.acid,                                                       
##       HCO3,                                                            
##       calcium,                                                         
##       LDH,                                                             
##       globulin,                                                        
##       GGT,                                                             
##       serum.sodium,                                                    
##       ALAT,                                                            
##       eGFR,                                                            
##       creatinine}             0.09142857              0.000000000    16
## [12] {serum.chloride,                                                  
##       procalcitonin,                                                   
##       ALP,                                                             
##       albumin,                                                         
##       total.bilirubin,                                                 
##       total.protein,                                                   
##       urea,                                                            
##       corrected.calcium,                                               
##       serum.potassium,                                                 
##       direct.bilirubin,                                                
##       total.cholesterol,                                               
##       AST,                                                             
##       uric.acid,                                                       
##       HCO3,                                                            
##       calcium,                                                         
##       LDH,                                                             
##       globulin,                                                        
##       GGT,                                                             
##       hs.CRP,                                                          
##       serum.sodium,                                                    
##       ALAT,                                                            
##       eGFR,                                                            
##       creatinine}             0.09142857              0.000000000    16

The algorithm returned 12 closed frequent itemsets. Analyzing the statistics above, we can draw some conclusions:

  1. Itemset 3 contains tests detecting presence of antibodies HIV, HCV, Treponema Pallidum antibodies and HBsAg protein (arises in presence of HBV). The aim of these tests is to determine if the patient is a carrier of any of those pathogens.
  2. Itemsets 4, 11 and 12 differentiate from each other at most with 3 elements, thus their intersection can be treated as a single itemset.
  3. Itemsets 4, 11 and 12 consist mostly of items found in standard Blood Biochemistry tests.
  4. Itemset 7 consists of items found in the standard Complete Blood Count test.
  5. The items in itemset 5 are known as coagulation tests, which measure blood’s ability to clot.
  6. Glucose and procalcitonin levels are often checked together as indicated by itemset 8. Procalcitonin level rises in response to a pro-inflammatory stimulus, especially of bacterial origin. It is therefore often classed as an acute phase reactant.

Final cleanup

For now the percent of NA values in the dataset is too high.

(sum(is.na(df)) / (nrow(df) * ncol(df))) * 100
## [1] 79.9435

To minimize the number of missing values, we will collapse all rows linked to one patient with tests’ results, which were released on the same day. In case of multiple results of the same test during one day, the latest will be saved.

df_clean <- df %>%
  mutate(re.date = floor_date(re.date, unit = "day")) %>%
  group_by(patient.id, re.date) %>%
  summarize(across(age:outcome, last), across(hs.cTnI:creatinine, ~.[last(which(!is.na(.)))])) %>%
  ungroup()

Let’s remove patients who have not been tested for at least 50% of all available tests.

ids <- df_clean %>%
  group_by(patient.id) %>%
  summarize(across(hs.cTnI:creatinine, ~any(!is.na(.)))) %>%
  summarize(patient.id = patient.id, n = rowSums(select(., c(hs.cTnI:creatinine)))) %>%
  ungroup() %>%
  filter(n > ncol(select(df_clean, c(hs.cTnI:creatinine))) * 0.5)

df_clean <- df_clean %>%
  filter(patient.id %in% ids$patient.id)

(sum(is.na(df_clean)) / (nrow(df_clean) * ncol(df_clean))) * 100
## [1] 51.36116

As we can see, the percentage of NA values decreased from approximately 80% to 51%, which is still high, but deleting even more records would significantly reduce the size of the dataset.

Dataset description

Final version of the dataset contains 1698 rows and 80 columns. Each row represents a set of blood tests results, which were performed for a single patient on a given day.

dim(df_clean)
## [1] 1698   80

Detailed description of individual columns of the dataset with some basic statistics are introduced in the codebook below.

Sample

A sample of clean data is presented below.

Dataset analysis

This section is dedicated to attribute values analysis and visualization. It covers not only separate study of each variable but also attempts of finding correlation between them.

Data over time

The following chart presents how many blood tests of a given type were performed during each day from January 10 to February 18, 2020.

At first glance, we observe a significant rise in the total number of tests after January 27, which suggests that from this day forward more people infected with COVID-19 were admitted to the hospital on a daily basis. Another fact is that before that day the variety of blood test types, which were performed, was visibly lower. Throughout the whole considered period, the most frequently carried out exams were tests included in Complete Blood Count, Blood Biochemistry and glucose level categories.

To confirm the suspicion of a growing number of admitted patients starting on January 27, let’s construct a chart visualizing the number of admitted and discharged patients on a given day.

The number of patients admitted to the hospital on a single day peaked on February 10, whereas the highest number of discharged patients was observed on February 18.

Attribute analysis

This section is broke into 3 parts: analyzing data concerning patient’s personal information, blood test results for each type and correlation between them.

Information about a patient

The following overlaying histograms portray age distribution of COVID-19 survivors and victims.

The majority of patients, who died for discussed disease, were people aged 65 years or older. Among younger people there were only 2 lethal victims. Beside those 2 outlier cases, people below 40 years were reported to be cured.

The next figure shows gender and outcome of treatment distributions.

In a set of 357 patients, 210 of them were male and 147 were female. The ratio of the deceased patients to the ones, who were cured, equals to approximately 0.84, which even at the time was alarmingly high.

Let’s calculate the total amount of time spent in the hospital by each patient and group these values by the outcome of treatment.

The figure shows that the mean number of days of medical follow-up for the deceased patients was less than the corresponding value for the survivors. The conclusion is that people who were reported dead were admitted to the hospital mostly in already severe condition, whereas patients who got the treatment in early onset of disease had better chances of getting cured.

Blood tests

Based on the assumption that potential readers of this analysis are not proficient in interpreting blood test results, calculating basic statistics for each column won’t be very helpful in understanding what they mean. To avoid this, the data in each column was grouped by the outcome of treatment. Comparing results of these two classes gives us a useful point of view in deciding, which value ranges can be considered abnormal for a specific variable.

At first glance, we notice graphs with most visible contrasts between ‘survived’ and ‘deceased’ classes, which are for lymphocytes, D-Dimer, LDH, FDP and hs-CRP. These items are present in 3 different maximal frequent itemsets, which were detected earlier and classified as Blood Biochemistry, Total Blood Count and coagulation tests. Trying to analyze the plots by these categories, we observe some interesting trends for patients who died.

Blood Biochemistry:

  • abnormal levels of some of microelements (serum.chloride, serum.sodium, calcium)
  • multiple outliers with increased levels of total bilirubin and indirect bilirubin
  • decreased albumin level
  • significantly elevated level of urea in bloodstream and its acidity
  • numerous outliers with increased value for AST and GGT, which can indicate liver damage
  • notable increase in LDH level suggesting tissue damage in vital organs,
  • serious rise in hs.CRP values, which detects inflammation during infection and risk of heart disease
  • lower level of eGFR and higher numbers for creatinine suggests possible problems with kidneys

These insights suggest possible damage to vital organs and malfunction of the excretory system caused by COVID-19.

Total Blood Count:

  • lower level of eosinophils, percentage of basophils and monocytes, count of platelets and lymphocytes

Coagulation tests:

  • lower level of thrombocytocrit, prothrombin activity and AT
  • significant increase in D-Dimer and FDP level
  • numerous outliers in PT, TT, INR, aPTT

Abnormal coagulation results, in particular for patients who had markedly elevated D-D and FDP, seem to go hand in hand with a higher risk of mortality associated with COVID-19.

The abnormally high result values for hs-cTnI and NT-proBNP were other observed indicators for higher disease severity. This article confirms that statement for serum high-sensitivity cardiac Troponin I, adding that it is true only for patients with no preexisting cardiovascular disease.

Similarly, extremely high levels of procalcitonin are positively associated with the severity of COVID-19 according to this publication. In this case, it can help in distinguishing between severe/critical patients and moderate patients with the disease.

Correlation

In order to check if some of variables concerning blood test results values are correlated, we calculate Pearson correlation coefficient for columns [8:80], obtaining correlation matrix. Beside the value of the correlation coefficient we will take under consideration significance level as well. To do so, we calculate p-values.

The plot below presents a correlogram combined with the significance test sorted by the coefficient value. Correlations with p-value > 5% are considered as insignificant, thus the correlation coefficient values are left blank in that case. The diagonal is omitted for better readability. The correlogram reveals a couple of strongly correlated attribute clusters marked with dark blue color expressing value of correlation close to 1. Most of them are groups of tests performed alternatively or on a similar purpose.

D-Dimer and FDP, which were stated as important indicators in the previous section, are strongly correlated. The same goes for the pair of PT and INR and a bigger cluster consisting of 4 elements: LDH, hs.cTnI, NT-proBNT and ferritin.

Classifier

This section describes the consecutive phases of building a classifier, which predicts whether a patient will be cured or die of COVID-19 based on his or her blood test results. The subsections are focusing on dataset preprocessing, the classifier building and analysis of attribute importance respectively.

Dataset preprocessing

Firstly, the dataset has to be transformed to a form where observations represent patients. To do so, we group clean dataset by patient.id and, in case of many records in one column, choose the most recent result. Patient ID, admission and discharge timestamps, as well as date of results release are no longer needed, thus shall be dropped.

df_ppl <- df_clean %>%
  group_by(patient.id) %>%
  fill(everything()) %>%
  summarise(across(everything(), last)) %>%
  ungroup() %>%
  select(-c(patient.id, re.date, admission.time, discharge.time))

(sum(is.na(df_ppl)) / (nrow(df_ppl) * ncol(df_ppl))) * 100
## [1] 8.79773

The obtained dataset consists in around 9% of missing values. In attribute analysis we determined potentially useful in survival prediction columns. Let’s check how many unique numbers of NAs they contain.

##  [1]   1   2   3   4   5  15  45  59  70  91 155

For columns with less than 16 NAs, the patients with those missing values will be deleted. After that, the remaining columns with any NA values will be dropped.

na_n <- sapply(df_ppl[, cols_all],  function(c) sum(is.na(c), TRUE))
cols <- names(na_n)[na_n < 16]

df_ppl <- df_ppl %>%
  filter(across(cols, ~!is.na(.))) %>%
  select_if(~ !any(is.na(.)))

dim(df_ppl)
## [1] 343  46
colnames(df_ppl)
##  [1] "age"                    "gender"                 "outcome"               
##  [4] "hemoglobin"             "serum.chloride"         "PT"                    
##  [7] "eosinophils.percent"    "ALP"                    "albumin"               
## [10] "basophil.percent"       "total.bilirubin"        "platelet.count"        
## [13] "monocytes.percent"      "indirect.bilirubin"     "neutrophils.percent"   
## [16] "total.protein"          "prothrombin.activity"   "MCV"                   
## [19] "hematocrit"             "WBC.count"              "MCHC"                  
## [22] "urea"                   "lymphocyte.count"       "RBC.count"             
## [25] "eosinophil.count"       "serum.potassium"        "neutrophils.count"     
## [28] "direct.bilirubin"       "lymphocyte"             "total.cholesterol"     
## [31] "AST"                    "uric.acid"              "HCO3"                  
## [34] "calcium"                "LDH"                    "monocytes.count"       
## [37] "globulin"               "GGT"                    "INR"                   
## [40] "basophil.count.percent" "MCH"                    "hs.CRP"                
## [43] "serum.sodium"           "ALAT"                   "eGFR"                  
## [46] "creatinine"

Building classifier

In order to build a classifier, we divide previously prepared dataset into training (75%) and testing (25%) sets. The method used for training is repeated k-cross validation with 2 folds and 5 repeats. The model of choice is Random Forest.

inTraining <- createDataPartition(y = df_ppl$outcome, p = .75, list = FALSE)

training <- df_ppl[inTraining,]
testing  <- df_ppl[-inTraining,]

rfGrid <- expand.grid(mtry = 10:30)
gridCtrl <- trainControl(
    method = "repeatedcv",
    summaryFunction = twoClassSummary,
    classProbs = TRUE,
    number = 2,
    repeats = 5)


fitTune <- train(outcome ~ .,
             data = training,
             method = "rf",
             metric = "ROC",
             preProc = c("center", "scale"),
             trControl = gridCtrl,
             tuneGrid = rfGrid,
             ntree = 30)

fitTune
## Random Forest 
## 
## 258 samples
##  45 predictor
##   2 classes: 'survived', 'deceased' 
## 
## Pre-processing: centered (45), scaled (45) 
## Resampling: Cross-Validated (2 fold, repeated 5 times) 
## Summary of sample sizes: 129, 129, 129, 129, 129, 129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   10    0.9927119  0.9600000  0.9677966
##   11    0.9920702  0.9671429  0.9677966
##   12    0.9904358  0.9628571  0.9745763
##   13    0.9926998  0.9685714  0.9711864
##   14    0.9931840  0.9614286  0.9627119
##   15    0.9918886  0.9642857  0.9661017
##   16    0.9927119  0.9614286  0.9661017
##   17    0.9918039  0.9657143  0.9694915
##   18    0.9922518  0.9642857  0.9644068
##   19    0.9914407  0.9642857  0.9474576
##   20    0.9921186  0.9614286  0.9593220
##   21    0.9911138  0.9642857  0.9610169
##   22    0.9915133  0.9628571  0.9491525
##   23    0.9901211  0.9585714  0.9525424
##   24    0.9904116  0.9585714  0.9508475
##   25    0.9904843  0.9600000  0.9508475
##   26    0.9917070  0.9685714  0.9508475
##   27    0.9892615  0.9600000  0.9525424
##   28    0.9906659  0.9657143  0.9576271
##   29    0.9907022  0.9614286  0.9525424
##   30    0.9890557  0.9528571  0.9457627
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 14.
ggplot(fitTune)

rfTuneOutcomes <- predict(fitTune, newdata = testing)
confusionMatrix(data = rfTuneOutcomes, testing$outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction survived deceased
##   survived       46        1
##   deceased        0       38
##                                           
##                Accuracy : 0.9882          
##                  95% CI : (0.9362, 0.9997)
##     No Information Rate : 0.5412          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9763          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9744          
##          Pos Pred Value : 0.9787          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5412          
##          Detection Rate : 0.5412          
##    Detection Prevalence : 0.5529          
##       Balanced Accuracy : 0.9872          
##                                           
##        'Positive' Class : survived        
## 

The accuracy of the obtained classifier equals approximately 98%.

Attribute importance

The plot below portrays importance of attributes used in model training. The first 16 variables are included in the set of columns, which we discussed as crucial in the attribute analysis section. What’s surprising, the female gender is quite high in the ranking.

Conclusion

In general, we observe a higher fatality rate of COVID-19 disease in the elderly group and people with preexisted underlying health conditions. People who have been provided with medical care from very onset phases of the disease were more likely to survive. Blood biochemistry and CBC tests proved to be useful in detecting first signs of infection. Based on source data, the most accurate biomarkers in detecting patients with COVID-19 are extremely high levels of hs-cTnI, NT-proBNP and hs-CRP. The high importance of hs-CRP was confirmed by the classifier. The trained model outputted LDH and quantity of various white blood cells as the most crucial biomarkers in detecting presence of coronavirus-19 in the organism.